Statistical Inference

Chelsea Parlett-Pelleriti

The Problem Of Inference

Review

  • Description

  • Inference

  • Prediction

Why Do We Need Inference?

In your groups: You want to know whether a new anti-acid reduces people’s heartburn.

Think about how you might go about using data to answer this question.

Definition

Statistical Inference is using information/data from a sample to draw conclusions about a population

\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).

Definition

Statistical Inference is using information/data from a sample to draw conclusions about a population

\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).

For example:

Definition

Statistical Inference is using information/data from a sample to draw conclusions about a population

\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).

Populations can be thought of as:

  • groups of existing “experimental units”
  • A Data Generating Process (DGP)

Populations

Think about the heights of all people in the USA named Michael.

  • Group of Existing People: All 3.28M Michaels

  • DGP for Michaels: the theoretical process that creates Michaels

Intuition

I claim that I’m faster at crosswords than you. We both do one crossword:

  • my time: 25m 05s

  • your time: 25m 23s


Am I right? What would it take to convince you I’m right?

Intuition

Intuition

Intuition

Inference: First Problem

What do I really mean when I say that I’m faster at crosswords that you?

Inference: First Problem

Statistics: are functions of data that summarize the data, \(T(\mathbf{X})\).

  • Population Statistic: \(T(\mathbf{X})\); where \(\mathbf{X}\) is the random variable (e.g. mean height of people named Michael)

  • Sample Statistic: \(T(\mathbf{x})\); where \(\mathbf{x}\) is a realized sample of \(\mathbf{X}\) (e.g. mean of 100 randomly sampled heights of people named Michael)

Inference: First Problem

Sample Mean: \(\frac{1}{N} \sum_{i=1}^N x_i\)

(or: 67th percentile, min, max, variance, z-statistic…)

  • Pros: statistics summarize information about the data in an easily digestible way
  • Cons: they often throw out information

Inference: First Problem

When choosing a statistic, you’re implicitly agreeing that two samples are the same if \(T(\mathbf{x}) = T(\mathbf{y})\)

Statistics

New York Subway System

Back to Crosswords

I claim that my mean crossword time is faster than yours:

\[ \mu_{me} > \mu_{you} \]


Note: here, let’s think of our mean times as the DGP that generates our observed times:

\[ \text{Chelsea}_i \sim N \left (\mu_{me}, \sigma_{me} \right) \\ \text{You}_i \sim N \left (\mu_{you}, \sigma_{you} \right) \]

Back to Crosswords

But I can’t possibly observe \(\mu_{me}\) and \(\mu_{you}\)

Sample \(\to\) Population

If we were doing description, the sample mean alone would accomplish our goal.

  • what is the mean crossword time?
    • my time: 25m 05s
    • your time: 25m 23s


But if doing inference, we want to generalize. We don’t want to know about \(\bar{x}\), we want to know about \(\mu\).

Note: we often use Greek letters to denote population statistics and Roman letters for sample statistics.

Sample \(\to\) Population

New Goal: Formalize a process to get from Sample statistics to Population statistics

e.g. what can I learn about the mean height of college students, \(\mu\) from random sample mean \(\bar{x}\)

Estimands, Estimator, Estimates

via Peter Tennant at https://x.com/PWGTennant/status/1164084443742691328

Estimands, Estimator, Estimates

  • Estimand: a target quantity to be estimated

    • \(\mu\) mean height of all people named Michael in the US
  • Estimator: a function, \(W(\mathbf{x})\) that is a recipe about how to get an estimate from a sample

    • \(\bar{x} = W(\mathbf{heights}) = \frac{1}{N} \sum_{i=1}^N height_i\)
  • Estimate: a realized value of \(W(\mathbf{x})\) applied to an actual sample, \(\mathbf{x}\)

    • \(\bar{x} = W(176,177,175,179,173) = 176\)

Choosing an Estimator

Finding an Estimator

Sometimes, finding an estimator feels intuitive (e.g. using sample mean to estimate population mean) but remember, someone at some point had to figure out that the sample mean was a good estimator.

  1. Method of Moments

  2. Maximum Likelihood Estimators

    • Expectation Maximization (EM)

Finding an Estimator: MOM

Method of Moments

Set the first \(k\) sample moments equal to the first \(k\) population moments, and solving.

\[ \underbrace{m_1}_\text{1st samp moment} = \overbrace{\mu'_1}^\text{1st population moment} \\ m_2 = \mu'_2 \\ ... \\ m_n = \mu'_n \\ \]

Review: Moments

Moments of a distribution are expectations.

\[ \mu'_n = \mathbb{E}X^n \]

Central Moments replace \(X\) with the mean centered value \(X-\mu\).

\[ \mu_n = \mathbb{E}(X-\mu)^n \]

Finding an Estimator: MOM

Method of Moments

Remember:

  • \(p^{th}\) sample moment: \(\frac{1}{n} \sum_{i=1}^n X_i^p\)

  • \(p^{th}\) population moment: \(\mathbb{E}(X^p)\)

Finding an Estimator: MOM

Method of Moments

Let’s say \(x \sim \mathcal{N}(\theta, \sigma^2)\), \(k = 2\)

  • first moment: \(\frac{1}{n} \sum_{i=1}^n x_i= \mathbb{E}(X) = \mu\)

    • \(\hat{\mu} = \bar{x}\)
  • second moment: \(\frac{1}{n} \sum_{i=1}^n (x_i-\bar{x})^2 = \mathbb{E}(X-\mu)^2 = \mu^2 + \sigma^2\)

    • \(\frac{1}{n} \sum_{i=1}^n x_i^2 = \bar{x}^2 + \sigma^2\)

    • \(\hat{\sigma}^2 = \left [\frac{1}{n} \sum_{i=1}^n x_i^2 \right] - \bar{x}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\)

Finding an Estimator: MOM

okay but that felt too easy…

Finding an Estimator: MOM

A gamma distribution has two parameters, \(\alpha\) and \(\theta\). Let’s use MoM to find estimators for them.

\[ f(x; \alpha, \theta) = \frac{1}{\Gamma(\alpha)\theta^\alpha} x^{\alpha-1} e^{-x/\theta} \]

Finding an Estimator: MOM

First moment: \[ \mathbb{E}(X_i) = \mu = \alpha\theta \] Second central moment: \[ Var(X_i) = \mathbb{E}\left[ (X_i - \mu)^2\right ] = \alpha\theta^2 \]

Finding an Estimator: MOM

Next, we set these equal to the sample moments: \[ \mathbb{E}(X_i) = \mu = \alpha\theta = \underbrace{\frac{1}{n}\sum x_i}_\text{sample mean} = \bar{x} \]

\[ Var(X_i) = \mathbb{E}\left[ (X_i - \mu)^2\right ] = \alpha\theta^2 = \underbrace{\frac{1}{n}\sum (x_i-\bar{x})^2}_\text{sample var} \]

Finding an Estimator: MOM

Now solve for each parameter!

Try it yourself!

Finding an Estimator: MOM

Now solve for each parameter!

\(\alpha\):

\[ \alpha\theta = \underbrace{\frac{1}{n}\sum x_i}_\text{sample mean} \\ \alpha = \frac{1}{n\theta}\sum x_i = \frac{\bar{x}}{\theta} \]

Finding an Estimator: MOM

Now solve for each parameter!

using \(\alpha = \frac{\bar{x}}{\theta}\) to sub into the variance equation…

\[ \alpha\theta^2 = \frac{\bar{x}}{\theta}\theta^2 = \bar{x}\theta = \underbrace{\frac{1}{n}\sum (x_i-\bar{x})^2}_\text{sample var} \rightarrow \\ \hat{\theta} = \underbrace{\frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2}_\text{MoM estimator} \]

Finding an Estimator: MOM

Now plug this back into our original equation to get \(\alpha\):

\[ \hat{\alpha} = \frac{\bar{x}}{\hat{\theta}} = \frac{\bar{x}}{\frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2} \]

Finding an Estimator: MOM

We’ve found the estimators!!!!

\[ \hat{\alpha} = \frac{\bar{x}}{\frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2} \\ \hat{\theta} = \frac{1}{n\bar{x}}\sum (x_i-\bar{x})^2 \]

hoorah.

Finding an Estimator: MLE

Maximum Likelihood Estimation

\[ \text{arg}\,\max\limits_{\theta} \mathcal{L}(\theta|x) \]

The estimate of \(\theta\) is the one that maximize the likelihood of the data, \(x\).

Maximum Likelihood Estimation

Maximum Likelihood Estimation

\[ p(x | \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}} \]

Where \(\theta = (\mu, \sigma)\). We want to choose values of \(\theta\) maximize the likelihood of the data, \(x\)

Maximum Likelihood Estimation

from Bayes Rules!

Maximum Likelihood Estimation

For a single data point the value of the likelihood function, \(L\left( \theta | x \right)\) is:

\[ \mathcal{L} \left( \theta | x\right) = p(x| \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}} \]

If data points in a sample are independent, the likelihood value for all data points is simply the product of their individual likelihood values, since \(p(A,B) = p(A)*p(B) \text{ iff } A \mathrel{\unicode{x2AEB}} B\).
\[ \mathcal{L}\left(\theta | \mathbf{x} \right) = p(\mathbf{x} |\theta) = \prod_{i=1}^n p(x_i | \theta) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x_i- \mu}{2\sigma^2}} \]

Maximum Likelihood Estimation

The higher the likelihood of our data, the more evidence that a particular \(\theta\) is a good fit for the data.

Maximum Likelihood Estimation

\[ \text{arg max}_{\theta} \left[ \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}} \right] \]

to maximize, we:

  1. take the partial derivatives of \(\prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}}\) w.r.t. each element of \(\theta\)
  2. set each \(\frac{\partial}{\partial \theta_i} = 0\)
  3. solve (analytically, Expectation Maximization, Gradient Descent…) for \(\theta\)


But…

Maximum Likelihood Estimation

…taking the derivative of a function of products is hard, so we use log likelihood.

\[ \ell\left(\theta | \mathbf{x} \right) = \log\left(\mathcal{L}\left(\theta | \mathbf{x} \right)\right) = \\ -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu_i)^2 \]

Note: \(\log()\) is a monotonically increasing function, so choosing \(\theta\) that maximizes \(\mathcal{l}\left(\theta | \mathbf{y} \right)\) will also maximize \(\mathcal{L}\left(\theta | \mathbf{y} \right)\)

Maximum Likelihood Estimation

\[ \ell\left(\theta | \mathbf{x} \right) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu_i)^2 \]

Example with normal distribution:

  • \(\hat{\mu} : \frac{\partial}{\partial \mu} \ell(\theta | x) = 0\)

  • \(\hat{\sigma} : \frac{\partial}{\partial \sigma} \ell(\theta | x) = 0\)

Maximum Likelihood Estimation

Solution for Normal Distribution:

  • \(\hat{\mu} = \frac{1}{n} \sum_{i=1}^nx_i\)

  • \(\hat{\sigma} = \frac{1}{n} \sum_{i=1}^n(x_i - \hat{\mu})^2\)